Screenshot 2024-07-27 at 1.39.35 PM.png

Operative: ETL Pipeline¶

Jonathan Sax
Northwestern University

In [1]:
import pandas as pd
import timeit
import psycopg2
import string
from sqlalchemy import create_engine
import mysql.connector 
import pymysql
import matplotlib.pyplot as plt
import requests
from collections import Counter
In [3]:
# Database connection 

postgres_params_local = {
    'user': 'postgres',
    'database': 'fall2024',
    'password': '',
    'host': 'localhost',
    'port': '5432',
}
In [4]:
# Establish a connection
conn = psycopg2.connect(**postgres_params_local)
In [5]:
conn
Out[5]:
<connection object at 0x12aa9e650; dsn: 'user=postgres password=xxx dbname=fall2024 host=localhost port=5432', closed: 0>

Step 1. EXTRACT.¶

Each agent's report is downloaded from Gigwell as a CSV file. This is done to remove confidential information like artists' performance fees, agent's commissions, deposit schedules, etc.

In [13]:
# Path to raw CSV file
jon_gigwell_file_path = "/Users/hydraulicdecoy/Desktop/Operative Analytics/Liaison-Artists-Booking-10-4-24.csv"

# Load CSV file into a pandas dataframe
df1 = pd.read_csv(jon_gigwell_file_path)
In [14]:
df1
Out[14]:
Event Date Artists Venue City Venue Country Signer Signer Email Buyer
0 1/1/23 Mary Droppinz Petco Park San Diego United States Kevin Wiles kevin@crssd.com CRSSD Music, LLC dba FNGRS CRSSD
1 1/1/23 Guy J 3 Dollar Bill Brooklyn United States Boris N borisnotthedj@gmail.com Inoki Management LLC
2 1/1/23 Ardalan Adventure Hornblower San Diego United States Jimbo James jimbo@lovelifeparty.com Music is 4 Lovers, LLC
3 1/1/23 Kevin De Vries Petco Park San Diego United States Kevin Wiles sophie@crssd.com CRSSD Music, LLC
4 1/1/23 Matador Ritoque Maipú Chile Claudio Agosin claudio.agosin@gmail.com Inversiones m y c
... ... ... ... ... ... ... ... ...
833 2/23/25 Victor Ruiz Autódromo Hermanos Rodríguez Iztacalco Mexico MICHELLE TREJO mtrejoa@ocesa.mx OCESA PROMOTORA SA DE CV (TAX ID: DEE-031112-SI7)
834 3/1/25 PAN-POT Knockdown Center Queens United States Tyler Myers tea@knockdowncenter.com DMI, Inc DBA Knockdown Center
835 3/14/25 Julya Karma Audio Nightclub San Francisco United States Peter Doukakis pselectsf@gmail.com Audio Nightclub
836 3/15/25 Rodriguez Jr. Majahuitas Puerto Vallarta Mexico Ricardo Farkas rfarkas@vallarta-adventures.com VALLARTA ADVENTURE SA DE CV
837 5/24/25 Rodriguez Jr. Private Residence SF San Francisco United States Jeremy Ringel ringelj6@yahoo.com Jeremy Ringel

838 rows × 8 columns

Step 2. TRANSFORM.¶

Rename the 'Buyer' column 'Promoter' as this is more colloquial for users (agents).

In [15]:
df1.rename(columns = {'Buyer':'Promoter'}, inplace = True)
In [16]:
df1
Out[16]:
Event Date Artists Venue City Venue Country Signer Signer Email Promoter
0 1/1/23 Mary Droppinz Petco Park San Diego United States Kevin Wiles kevin@crssd.com CRSSD Music, LLC dba FNGRS CRSSD
1 1/1/23 Guy J 3 Dollar Bill Brooklyn United States Boris N borisnotthedj@gmail.com Inoki Management LLC
2 1/1/23 Ardalan Adventure Hornblower San Diego United States Jimbo James jimbo@lovelifeparty.com Music is 4 Lovers, LLC
3 1/1/23 Kevin De Vries Petco Park San Diego United States Kevin Wiles sophie@crssd.com CRSSD Music, LLC
4 1/1/23 Matador Ritoque Maipú Chile Claudio Agosin claudio.agosin@gmail.com Inversiones m y c
... ... ... ... ... ... ... ... ...
833 2/23/25 Victor Ruiz Autódromo Hermanos Rodríguez Iztacalco Mexico MICHELLE TREJO mtrejoa@ocesa.mx OCESA PROMOTORA SA DE CV (TAX ID: DEE-031112-SI7)
834 3/1/25 PAN-POT Knockdown Center Queens United States Tyler Myers tea@knockdowncenter.com DMI, Inc DBA Knockdown Center
835 3/14/25 Julya Karma Audio Nightclub San Francisco United States Peter Doukakis pselectsf@gmail.com Audio Nightclub
836 3/15/25 Rodriguez Jr. Majahuitas Puerto Vallarta Mexico Ricardo Farkas rfarkas@vallarta-adventures.com VALLARTA ADVENTURE SA DE CV
837 5/24/25 Rodriguez Jr. Private Residence SF San Francisco United States Jeremy Ringel ringelj6@yahoo.com Jeremy Ringel

838 rows × 8 columns

Now it's time to create the four tables. I will create them as pandas dataframes for quick/easy data verification first, then I will import them into the postgreSQL database.¶

In [17]:
# Normalize the Artists table
artists_df = df1[['Artists']].drop_duplicates().reset_index(drop=True)
artists_df['artistid'] = artists_df.index + 1  # Assigning a unique ID

# Normalize the Venues table
venues_df = df1[['Venue', 'City', 'Venue Country']].drop_duplicates().reset_index(drop=True)
venues_df['venueid'] = venues_df.index + 1  # Assigning a unique ID

# Normalize the Promoters table
promoters_df = df1[['Promoter', 'Signer', 'Signer Email']].drop_duplicates().reset_index(drop=True)
promoters_df['promoterid'] = promoters_df.index + 1  # Assigning a unique ID
promoters_df['active'] = True  # Defaulting 'active' to True for now, as we are only pulling 2023-2025 booking data.

Now it's time to create the events DataFrame with Foreign Keys referencing artistid, venueid, and promoterid.¶

In [18]:
# Merge to create the events table with foreign keys
events_df = df1.merge(artists_df, on='Artists')\
              .merge(venues_df, on=['Venue', 'City', 'Venue Country'])\
              .merge(promoters_df, on=['Promoter', 'Signer', 'Signer Email'])

# Select and reorder columns to match the final schema
events_df = events_df[['Event Date', 'Artists', 'artistid', 'Venue', 'venueid', 'City', 'Venue Country', 'Promoter', 'promoterid']]

# Add a unique eventid column
events_df['eventid'] = events_df.index + 1
In [19]:
venues_df
Out[19]:
Venue City Venue Country venueid
0 Petco Park San Diego United States 1
1 3 Dollar Bill Brooklyn United States 2
2 Adventure Hornblower San Diego United States 3
3 Ritoque Maipú Chile 4
4 INCENDIA Tulum Mexico 5
... ... ... ... ...
399 Playa Avellana Los Pargos Costa Rica 400
400 Black Circle Indianapolis United States 401
401 The Sterling Aspen United States 402
402 Vina Del Mar Viña del Mar Chile 403
403 Private Residence SF San Francisco United States 404

404 rows × 4 columns

Adding AristGenre columns to the artists_df¶

The next step is to count the genres of each artist's most-recent songs on Beatport (electronic music's premier digital music website). Each artist's top three genres will be recorded in columns artistgenre1, artistgenre2, and artistgenre3 in the artists_df dataframe.

In [20]:
# Path to updated artists_df with genre columns added:
gigwell_file_path = "/Users/hydraulicdecoy/Desktop/Operative Analytics/Tables as of October 2024/artists_df_withgenres.csv"

# Load CSV file into DataFrame
artists_df = pd.read_csv(gigwell_file_path)
In [21]:
artists_df
Out[21]:
Artists artistid artistgenre1 artistgenre2 artistgenre3
0 AceMoMa 1 Techno (Peak Time / Driving) Breaks / Breakbeat / UK Bass Drum & Bass
1 Yulia Niko 2 Melodic House & Techno Afro House Indie Dance
2 Jamie Jones 3 Tech House House Minimal / Deep Tech
3 Ash Lauryn 4 House Deep House Dance / Pop
4 Hot Since 82 5 House Tech House Melodic House & Techno
... ... ... ... ... ...
90 Desert Hearts 91 Tech House Jackin House House
91 Juan Hansen 92 Melodic House & Techno Afro House Deep House
92 JOPLYN 93 Melodic House & Techno Electronica House
93 Arodes 94 Melodic House & Techno Progressive House Afro House
94 Calussa 95 Tech House Afro House Melodic House & Techno

95 rows × 5 columns

Adding PromoterGenre columns to the promoters_df¶

Now, we need to count the genres of music of the artists each promoter books. Then, we compile the top three genres each promoter books, and record them in columns promotergenre1, promotergenre2, and promotergenre3 in the promoters_df dataframe.

In [22]:
# Step 1: Merge events_df with artists_df to bring in the artist genres
events_with_genres = events_df.merge(artists_df[['artistid', 'artistgenre1', 'artistgenre2', 'artistgenre3']], 
                                     on='artistid', how='left')
In [23]:
events_with_genres
Out[23]:
Event Date Artists artistid Venue venueid City Venue Country Promoter promoterid eventid artistgenre1 artistgenre2 artistgenre3
0 1/1/23 Mary Droppinz 1 Petco Park 1 San Diego United States CRSSD Music, LLC dba FNGRS CRSSD 1 1 Techno (Peak Time / Driving) Breaks / Breakbeat / UK Bass Drum & Bass
1 1/1/23 Guy J 2 3 Dollar Bill 2 Brooklyn United States Inoki Management LLC 2 2 Melodic House & Techno Afro House Indie Dance
2 1/1/23 Ardalan 3 Adventure Hornblower 3 San Diego United States Music is 4 Lovers, LLC 3 3 Tech House House Minimal / Deep Tech
3 1/1/23 Kevin De Vries 4 Petco Park 1 San Diego United States CRSSD Music, LLC 4 4 House Deep House Dance / Pop
4 1/1/23 Matador 5 Ritoque 4 Maipú Chile Inversiones m y c 5 5 House Tech House Melodic House & Techno
... ... ... ... ... ... ... ... ... ... ... ... ... ...
833 2/23/25 Victor Ruiz 14 Autódromo Hermanos Rodríguez 39 Iztacalco Mexico OCESA PROMOTORA SA DE CV (TAX ID: DEE-031112-SI7) 374 834 House Tech House Minimal / Deep Tech
834 3/1/25 PAN-POT 11 Knockdown Center 28 Queens United States DMI, Inc DBA Knockdown Center 27 835 House Tech House Breaks / Breakbeat / UK Bass
835 3/14/25 Julya Karma 21 Audio Nightclub 87 San Francisco United States Audio Nightclub 151 836 Minimal / Deep Tech Tech House House
836 3/15/25 Rodriguez Jr. 7 Majahuitas 221 Puerto Vallarta Mexico VALLARTA ADVENTURE SA DE CV 208 837 Progressive House Melodic House & Techno Deep House
837 5/24/25 Rodriguez Jr. 7 Private Residence SF 404 San Francisco United States Jeremy Ringel 375 838 Progressive House Melodic House & Techno Deep House

838 rows × 13 columns

In [24]:
# Step 2: Reshape the data to have promoterid and all genres in a single column
# We will stack artistgenre1, artistgenre2, artistgenre3 into a single genre column.
genres_long = pd.melt(events_with_genres, 
                      id_vars=['promoterid'], 
                      value_vars=['artistgenre1', 'artistgenre2', 'artistgenre3'], 
                      var_name='genre_type', value_name='genre')
In [25]:
genres_long
Out[25]:
promoterid genre_type genre
0 1 artistgenre1 Techno (Peak Time / Driving)
1 2 artistgenre1 Melodic House & Techno
2 3 artistgenre1 Tech House
3 4 artistgenre1 House
4 5 artistgenre1 House
... ... ... ...
2509 374 artistgenre3 Minimal / Deep Tech
2510 27 artistgenre3 Breaks / Breakbeat / UK Bass
2511 151 artistgenre3 House
2512 208 artistgenre3 Deep House
2513 375 artistgenre3 Deep House

2514 rows × 3 columns

In [26]:
# Step 3: Count the occurrences of each genre for each promoter
genre_counts = genres_long.groupby(['promoterid', 'genre']).size().reset_index(name='count')
In [27]:
genre_counts
Out[27]:
promoterid genre count
0 1 Breaks / Breakbeat / UK Bass 1
1 1 Drum & Bass 1
2 1 Techno (Peak Time / Driving) 1
3 2 Afro House 2
4 2 Indie Dance 2
... ... ... ...
1802 374 Minimal / Deep Tech 1
1803 374 Tech House 1
1804 375 Deep House 1
1805 375 Melodic House & Techno 1
1806 375 Progressive House 1

1807 rows × 3 columns

In [28]:
# Step 4: Rank the genres for each promoter by the count of shared events
genre_counts['rank'] = genre_counts.groupby('promoterid')['count'].rank(method='first', ascending=False)
In [29]:
genre_counts
Out[29]:
promoterid genre count rank
0 1 Breaks / Breakbeat / UK Bass 1 1.0
1 1 Drum & Bass 1 2.0
2 1 Techno (Peak Time / Driving) 1 3.0
3 2 Afro House 2 1.0
4 2 Indie Dance 2 2.0
... ... ... ... ...
1802 374 Minimal / Deep Tech 1 2.0
1803 374 Tech House 1 3.0
1804 375 Deep House 1 1.0
1805 375 Melodic House & Techno 1 2.0
1806 375 Progressive House 1 3.0

1807 rows × 4 columns

In [30]:
# Step 5: Pivot the data to get the top 3 genres for each promoter
top_genres = genre_counts.pivot_table(index='promoterid', 
                                      columns='rank', 
                                      values='genre', 
                                      aggfunc='first').reset_index()
In [31]:
top_genres
Out[31]:
rank promoterid 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
0 1 Breaks / Breakbeat / UK Bass Drum & Bass Techno (Peak Time / Driving) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 Afro House Indie Dance Melodic House & Techno NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 Deep House Melodic House & Techno Progressive House House Minimal / Deep Tech Tech House NaN NaN NaN NaN NaN NaN NaN NaN
3 4 Dance / Pop Deep House House NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 House Tech House Breaks / Breakbeat / UK Bass Melodic House & Techno NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
370 371 Deep House Melodic House & Techno Progressive House NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
371 372 House Tech House Techno (Peak Time / Driving) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
372 373 House Melodic House & Techno Tech House NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
373 374 House Minimal / Deep Tech Tech House NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
374 375 Deep House Melodic House & Techno Progressive House NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

375 rows × 15 columns

In [32]:
# Step 5b: Drop any columns with rank greater than 3 (e.g., "4.0" and beyond)
# Filter out columns '4.0' and above, keep only promoterid and ranks 1, 2, 3
columns_to_keep = ['promoterid', 1.0, 2.0, 3.0]  
top_genres = top_genres[columns_to_keep]
In [33]:
# Step 5c: Rename the columns to promotergenre1, promotergenre2, promotergenre3
top_genres.columns = ['promoterid', 'promotergenre1', 'promotergenre2', 'promotergenre3']
In [34]:
top_genres
Out[34]:
promoterid promotergenre1 promotergenre2 promotergenre3
0 1 Breaks / Breakbeat / UK Bass Drum & Bass Techno (Peak Time / Driving)
1 2 Afro House Indie Dance Melodic House & Techno
2 3 Deep House Melodic House & Techno Progressive House
3 4 Dance / Pop Deep House House
4 5 House Tech House Breaks / Breakbeat / UK Bass
... ... ... ... ...
370 371 Deep House Melodic House & Techno Progressive House
371 372 House Tech House Techno (Peak Time / Driving)
372 373 House Melodic House & Techno Tech House
373 374 House Minimal / Deep Tech Tech House
374 375 Deep House Melodic House & Techno Progressive House

375 rows × 4 columns

In [35]:
# Step 6: Merge the ranked genres back into promoters_df
promoters_df = promoters_df.merge(top_genres, on='promoterid', how='left')
In [36]:
promoters_df
Out[36]:
Promoter Signer Signer Email promoterid active promotergenre1 promotergenre2 promotergenre3
0 CRSSD Music, LLC dba FNGRS CRSSD Kevin Wiles kevin@crssd.com 1 True Breaks / Breakbeat / UK Bass Drum & Bass Techno (Peak Time / Driving)
1 Inoki Management LLC Boris N borisnotthedj@gmail.com 2 True Afro House Indie Dance Melodic House & Techno
2 Music is 4 Lovers, LLC Jimbo James jimbo@lovelifeparty.com 3 True Deep House Melodic House & Techno Progressive House
3 CRSSD Music, LLC Kevin Wiles sophie@crssd.com 4 True Dance / Pop Deep House House
4 Inversiones m y c Claudio Agosin claudio.agosin@gmail.com 5 True House Tech House Breaks / Breakbeat / UK Bass
... ... ... ... ... ... ... ... ...
370 The BPM Festival Vito Tomasicchio sabrina@thebpmfestival.com 371 True Deep House Melodic House & Techno Progressive House
371 DBA Progression Indy Trung Ngo minhtrung1811@gmail.com 372 True House Tech House Techno (Peak Time / Driving)
372 VANGUARD PRODUCTION S.R.L. Lucas Figoli cesar_osorio@me.com 373 True House Melodic House & Techno Tech House
373 OCESA PROMOTORA SA DE CV (TAX ID: DEE-031112-SI7) MICHELLE TREJO mtrejoa@ocesa.mx 374 True House Minimal / Deep Tech Tech House
374 Jeremy Ringel Jeremy Ringel ringelj6@yahoo.com 375 True Deep House Melodic House & Techno Progressive House

375 rows × 8 columns

Now all four tables are in the correct dimensions to be loaded into the postgreSQL database.¶

In [40]:
venues_df.to_csv('venues_df.csv', index=False)
artists_df.to_csv('artists_df.csv', index=False)
promoters_df.to_csv('promoters_df.csv', index=False)
events_df.to_csv('events_df.csv', index=False)

Step 3. LOAD¶

In [47]:
loadcode = """
-- Create Artists table
CREATE TABLE artists_df (
    Artists VARCHAR,
    artistid SERIAL PRIMARY KEY,
	artistgenre1 VARCHAR,
    artistgenre2 VARCHAR,
    artistgenre3 VARCHAR
);

-- Create Venues table
CREATE TABLE venues_df (
    Venue VARCHAR,
	City VARCHAR,
    VenueCountry VARCHAR,
    venueid SERIAL PRIMARY KEY
);

-- Create Promoters table
CREATE TABLE promoters_df (
	Promoter VARCHAR,
    Signer VARCHAR,
    SignerEmail VARCHAR,
	promoterid SERIAL PRIMARY KEY,
    active BOOLEAN,
    promotergenre1 VARCHAR,
    promotergenre2 VARCHAR,
    promotergenre3 VARCHAR
);

-- Create Events table
CREATE TABLE events_df (
    EventDate VARCHAR,
    eventid SERIAL PRIMARY KEY,
    artistname VARCHAR,
	artistid INTEGER REFERENCES artists(artistid),
	venuename VARCHAR,
    venueid INTEGER REFERENCES venues(venueid),
	city VARCHAR,
	promotername VARCHAR,
    promoterid INTEGER REFERENCES promoters(promoterid)
);
"""

Now the latest booking data has been loaded into postgreSQL, and the database is ready for queries.¶